# Multimodal Instruction Following

Dimple 7B
Apache-2.0
Dimple is the first discrete diffusion multimodal large language model (DMLLM) that combines autoregressive and diffusion training paradigms. After training on the same dataset as LLaVA-NEXT, it outperforms LLaVA-NEXT-7B by 3.9%.
Image-to-Text Transformers English
D
rp-yu
422
3
Qwen2.5 VL 72B Instruct GGUF
Other
Qwen2.5-VL-72B-Instruct is a 72B-parameter multimodal large model that supports vision-language tasks, capable of understanding and generating text related to images.
Text-to-Image English
Q
Mungert
2,798
5
Smolvlm2 500M Video Instruct Mlx 8bit Skip Vision
Apache-2.0
MLX format model converted from SmolVLM2-500M-Video-Instruct, supporting video-to-text tasks
Image-to-Text Transformers English
S
mlx-community
51
2
Documentcogito
Apache-2.0
A fine-tuned multimodal model based on unsloth/Llama-3.2-11B-Vision-Instruct, optimized for vision-language tasks and enhanced instruction-following capabilities, achieving 2x training acceleration through the Unsloth framework
Text-to-Image Transformers English
D
Daemontatox
73
1
Turkish LLaVA V0.1
MIT
A Turkish visual-language model specifically designed for multimodal visual instruction-following tasks, capable of processing both visual (image) and text inputs to understand and execute instructions provided in Turkish.
Image-to-Text Other
T
ytu-ce-cosmos
86
10
Spydaz Web AI Llava
LLaVa is an open-source multimodal chatbot, fine-tuned on GPT-generated multimodal instruction-following data based on LLaMA/Vicuna, specifically optimized for chat/instruction-following as a multimodal version of LLM.
Image-to-Text Transformers Supports Multiple Languages
_
LeroyDyer
30
1
Llava 1.5 7b Llara D Inbc Aux B VIMA 80k
Apache-2.0
LLaRA is an open-source visual motion strategy model, fine-tuned from LLaVA-7b-v1.5 on instruction-following data and auxiliary datasets, primarily used for robotics research.
Transformers
L
variante
390
2
Denseconnector V1.5 8B
DenseConnector is an open-source chatbot, fine-tuned based on LLaMA/Vicuna and trained using GPT-generated multimodal instruction-following data.
Image-to-Text Transformers
D
HuanjinYao
17
7
Llava V1.6 Vicuna 7b
LLaVA is an open-source multimodal chatbot, fine-tuned on large language models using multimodal instruction-following data.
Text-to-Image Transformers
L
liuhaotian
31.65k
123
Llava V1.6 34b
Apache-2.0
LLaVA is an open-source multimodal chatbot, fine-tuned based on a large language model, supporting interactions with both images and text.
Text-to-Image
L
liuhaotian
9,033
351
Llava Int4
CC
LLaVA is a multimodal large model that achieves general-purpose visual assistant capabilities by connecting a visual encoder with a large language model
Text-to-Image Transformers
L
emon-j
40
2
Japanese Stable Vlm
Other
A vision-language instruction-following model capable of generating Japanese descriptions for input images and optionally processing input text (e.g., questions).
Image-to-Text Transformers Japanese
J
stabilityai
122
48
Llava V1.5 Mlp2x 336px Pretrain Vicuna 7b V1.5
LLaVA is an open-source multimodal chatbot, fine-tuned based on LLaMA/Vicuna and trained with GPT-generated multimodal instruction-following data.
Text-to-Image Transformers
L
liuhaotian
173
17
Llava V1.5 7b
LLaVA is an open-source multimodal chatbot, fine-tuned based on LLaMA/Vicuna, supporting image-text interaction.
Image-to-Text Transformers
L
liuhaotian
1.4M
448
Speechgpt 7B Cm
SpeechGPT is a large language model with intrinsic cross-modal dialogue capabilities, capable of perceiving and generating multimodal content, supporting interaction via speech and text.
Text-to-Audio Transformers
S
fnlp
47
7
Speechgpt 7B Ma
SpeechGPT is a large language model with intrinsic cross-modal dialogue capabilities, capable of perceiving and generating multimodal content based on human instructions.
Text-to-Audio Transformers
S
fnlp
37
5
Instructblip Vicuna 7b 8bit
InstructBLIP-Vicuna-7B is a vision-language model based on Vicuna-7B, supporting image-to-text conversion tasks.
Image-to-Text Transformers
I
Mediocreatmybest
24
3
Llava Llama 2 7b Chat Lightning Lora Preview
LLaVA is an open-source multimodal chatbot, fine-tuned based on LLaMA/Vicuna and trained with GPT-generated multimodal instruction-following data.
Text-to-Image Transformers
L
liuhaotian
251
12
Llava Lightning 7B Delta V1 1
Apache-2.0
LLaVA is an open-source chatbot fine-tuned with GPT-generated multimodal instruction-following data based on LLaMA/Vicuna
Text-to-Image Transformers
L
liuhaotian
699
21
Llava 7b Delta V0
Apache-2.0
LLaVA is an open-source chatbot fine-tuned with GPT-generated multimodal instruction-following data based on LLaMA/Vicuna, supporting visual and language multimodal interactions.
Text-to-Image Transformers
L
liuhaotian
131
17
Llava 13b Delta V0
Apache-2.0
LLaVA is an open-source chatbot fine-tuned with GPT-generated multimodal instruction-following data based on LLaMA/Vicuna, belonging to a Transformer-based autoregressive language model.
Text-to-Image Transformers
L
liuhaotian
352
221
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase